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ABSTRACT 

Some complex problems, such as image tagging and natural lan- 
guage processing, are very challenging for computers, where even 
state-of-the-art technology is yet able to provide satisfactory accu- 
racy. Therefore, rather than relying solely on developing new and 
better algorithms to handle such tasks, we look to the crowdsourc- 
ing solution - employing human participation - to make good the 
shortfall in current technology. Crowdsourcing is a good supple- 
ment to many computer tasks. A complex job may be divided into 
computer-oriented tasks and human-oriented tasks, which are then 
assigned to machines and humans respectively. 

To leverage the power of crowdsourcing, we design and imple- 
ment a Crowdsourcing Data Analytics System, CDAS. CDAS is a 
framework designed to support the deployment of various crowd- 
sourcing applications. The core part of CDAS is a quality-sensitive 
answering model, which guides the crowdsourcing engine to pro- 
cess and monitor the human tasks. In this paper, we introduce the 
principles of our quality-sensitive model. To satisfy user required 
accuracy, the model guides the crowdsourcing query engine for the 
design and processing of the corresponding crowdsourcing jobs. 
It provides an estimated accuracy for each generated result based 
on the human workers' historical performances. When verifying 
the quality of the result, the model employs an online strategy to 
reduce waiting time. To show the effectiveness of the model, we 
implement and deploy two analytics jobs on CDAS, a twitter sen- 
timent analytics job and an image tagging job. We use real Twitter 
and Flickr data as our queries respectively. We compare our ap- 
proaches with state-of-the-art classification and image annotation 
techniques. The results show that the human-assisted methods can 
indeed achieve a much higher accuracy. By embedding the quality- 
sensitive model into crowdsourcing query engine, we effectively 
reduce the processing cost while maintaining the required query 
answer quality. 

1. INTRODUCTION 

Crowdsourcing is widely adopted in Web 2.0 sites. For exam- 
ple, Wikipedia benefits from thousands of subscribers, who con- 
tinually write and edit articles for the site. Another example is 
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Figure 1: Crowdsourcing Application 

Yahoo! Answers, where users submit and answer questions. In 
Web 2.0 sites, most of the contents are created by individual users, 
not service providers. Crowdsourcing is the driving force of these 
web sites. To facilitate the development of crowdsourcing appli- 
cations, Amazon provides the Mechanical Turk (AMT)^ platform. 
Computer programmers can exploit AMT's API to publish jobs for 
human workers, who are good at some complex jobs, such as im- 
age tagging and natural language processing. The collective intel- 
ligence helps solve many computationally difficult tasks, thereby 
improving the quality of output and users' experience. Figure 1 il- 
lustrates the idea of using crowdsourcing techniques to divide up 
jobs. CrowdDB [6], HumanGS [19] and CrowdSearch [23] are re- 
cent examples of applications on Amazon's AMT crowdsourcing 
platform. 

Crowdsourcing relies on human workers to complete a job, but 
humans are prone to errors, which can make the results of crowd- 
sourcing arbitrarily bad. The reason is two-fold. First, to obtain re- 
wards, a malicious worker can submit random answers to all ques- 
tions. This can significantly degrade the quality of the results. Sec- 
ond, for a complex job, the worker may lack the required knowl- 
edge for handling it. As a result, an incorrect answer may be pro- 
vided. To address the above problems, in AMT, a job is split into 
many HITs (Human Intelligence Tasks) and each HIT is assigned 
to multiple workers so that replicated answers are obtained. If con- 
flicting answers are observed, the system will compare the answers 
of different workers and determine the correct one. For example, in 
the CrowdDB [6], the voting strategy is adopted. 

The replication strategy, however, does not fully solve the answer 
diversity problem. Suppose we want the precision of our image tags 
to be 95% and the cost of worker per HIT is $0.01. If we assign 
each HIT to too many workers, we will have to pay a high cost. 
On the other hand, if few workers provide tags, we will not have 
enough clue to infer the correct tags. Given an expected accuracy, 
we therefore need an adaptive query engine that guarantees high 
accuracy with high probability and incurs as little cost as possible. 



^ https://www.mturk.com/mturk/welcome 
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In this paper, we propose a quality-sensitive answering model 
for the crowdsourcing systems, which is designed to significantly 
improve the quality of query results and effectively reduce the pro- 
cessing cost at the same time. This model is the core of our pro- 
posed Crowdsourcing Data Analytics System (CDAS). CDAS ex- 
ploits the crowd intelligence to improve the performance of differ- 
ent data analytics jobs, such as image tagging and sentiment analy- 
sis. CDAS transforms the analytics jobs into human jobs and com- 
puter jobs, which are then processed by different modules. The hu- 
man jobs are handled by the crowdsourcing engine, which adopts 
a two-phase processing strategy. The quality-sensitive answering 
model is correspondingly split into two sub-models, a prediction 
model and a verification model. The sub-models are applied to dif- 
ferent phases, respectively. 

In the first phase, the engine employs the prediction model to 
estimate how many workers are required to achieve a specific accu- 
racy. The model generates its estimation by collecting the distribu- 
tion of all workers' historical performances. Based on the model's 
result, the engine creates and submits the HIT to the crowdsourc- 
ing platform. In the second phase, the engine obtains the answers 
from the human workers and refines them as different workers may 
return different results for the same question. To verify the an- 
swers from different human workers, the voting strategy is used 
in CrowdDB to select the correct one. In the simplest case, each 
HIT is sent to n workers (n is odd). A result is assumed to be 
"correct" and accepted, if no less than [^] workers return it. The 
voting strategy is simple, but is not very effective in the crowd- 
sourcing scenario. Suppose we have a set of product reviews and 
want to know the opinion of each review. We set the score to ei- 
ther "positive", "negative" or "neutral". If 30% of the workers vote 
"positive", 30% of the workers vote "negative" and the remaining 
workers vote "neutral", the voting strategy cannot decide which 
answer is more trustable. Moreover, even if more than 50% of the 
workers vote "negative", we cannot accept the answer directly - 
some malicious workers may collude to produce a false answer. To 
improve the accuracy of the crowdsourcing results, CDAS adopts a 
probabilistic approach. 

First, a verification model is employed to replace the voting strat- 
egy. It relies on workers' past performances (i.e., the workers' ac- 
curacies for historical queries) and combines vote distribution and 
workers' performances. Intuitively, the system is more likely to ac- 
cept the answers provided by the worker with a good accuracy. A 
random sampling approach is designed to estimate the workers' ac- 
curacies in each job. By applying the probability-based verification 
model, we can significantly improve the result quality. 

Second, instead of waiting for all the results, the adaptive query 
engine provides an approximate result with confidence and refines 
it gradually as more answers are returned. This technique has been 
designed based on our observation that in AMT, workers finish their 
jobs asynchronously. Therefore, it is important to offer the option 
of an approximate answer that is gradually improved as more re- 
sults are available, instead of letting the user wait for the comple- 
tion of the query. This strategy is similar to the traditional online 
query processing in philosophy and serves to improve users' expe- 
rience. 

To evaluate our model and the performance of CDAS, we im- 
plement two practical crowdsourcing jobs, a twitter sentiment an- 
alytics (TSA) job and an image tagging (IT) job. In TSA job, we 
submit a set of movie titles as our queries and try to find the opin- 
ions of Twitter users. In IT job, we use the images of Flickr as the 
queries and ask the human workers to choose the correct tags. We 
will show the effectiveness of our crowdsourcing engine based on 
the quality-sensitive answering model in the experimental section. 




Figure 2: CDAS Architecture 

The remainder of the paper is organized as follows. In Section 
2, we present the architecture of CDAS, and introduce the appli- 
cations implemented over CDAS. In Section 3, we introduce our 
prediction model for estimating a proper number of workers for 
each job. To improve the result accuracy, a probability-based ver- 
ification model is proposed in Section 4, which can be extended 
to support online processing. We evaluate the performance of our 
models in CDAS in Section 5, and discuss some related work in 
Section 6. We conclude the paper in Section 7. 



2. OVERVIEW 

In this section, we introduce the architecture of our Crowdsourc- 
ing Data Analytics System, CDAS, and discuss how to implement 
applications on top of CDAS. 

2.1 Architecture of CDAS 

CDAS is the system that exploits the crowdsourcing techniques 
to improve the performance of data analytics jobs. The core differ- 
ence between CDAS and the conventional analytics systems lies in 
the processing mechanism. CDAS employs human workers to as- 
sist the analytics tasks, while other systems rely solely on computer 
systems to answer the queries. Figure 2 shows the architecture of 
CDAS. CDAS consists of three major components: job manager, 
crowdsourcing engine and program executor. The job manager 
accepts the submitted analytics jobs and transforms them into a 
processing plan, which describes how the other two components 
(crowdsourcing engine and program executor) should collaborate 
for the job. In particular, the job manager partitions the job into two 
parts, one for the computers and one for the human workers. For 
example, in human-assisted image search, the human workers are 
responsible for providing the tags for each image, while the image 
classification and index construction are handled by the computer 
programs. In most cases, the two parts interact with each other dur- 
ing processing. The program executor summarizes the results of 
crowdsourcing engine, and the engine may change its job schedule 
due to the requests of program executor. 

The crowdsourcing engine processes human jobs in two phases. 

1 . In the first phase, the engine generates a query template for 
the specific type of human jobs. The query template follows 
the format of the crowdsourcing platform, such as AMT, and 
should be easily understood by human workers. The en- 
gine then translates each job from the job manager into a set 
of crowdsourcing tasks and publishes them into the crowd- 
sourcing platform. To reduce the crowdsourcing cost, the en- 
gine employs a prediction model, which estimates the num- 
ber of required human workers for a specific task based on 
the distribution of workers' performance. 
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Tweet Review Identification 

Query Tweet: 

JP!^ Duet w. Sin.... this is Inilahous. And amazing. ].mp/o90aGQ#iphone4S 

1 . The tweet shows the following emotioi} : 

© Best Ever 

© Good 

© Not Satisfied 

2. The reasons of the above sc ore: 



[ gubmit 1 

Figure 3: Query Template 

2. In the second phase, the human workers' answers are re- 
turned to the crowdsourcing engine, which combines the re- 
suhs and removes the ambiguity. A verification model is de- 
veloped to select the correct answer based on the probability 
estimation. 

Sometimes, the human tasks need to disclose some sensitive data 
to the public. We design a privacy manager inside the engine to 
address the problem. The privacy manager may adaptively change 
the formats of the generated questions for human workers. It may 
also reject some workers for a specific task. 

The performance of crowdsourcing engine is determined by the 
two models, the prediction model and verification model. We shall 
introduce the two models in the following sections and discuss the 
implementation of two practical applications, a twitter sentiment 
analytics (TSA) job and an image tagging (IT) job, to validate the 
performance of our models. 

2.2 Deploying Applications on CDAS 

In this section, we use the TSA job as a running example to show 
how to deploy an application on CDAS. TSA job is typically pro- 
cessed using machine learning and information retrieval techniques 
[2] [22]. However, as shown in the experimental section, CDAS can 
achieve a much higher accuracy than some of these traditional ap- 
proaches for the TSA job. 

In the TSA job, the query is formally defined as follows. 

Definition 1. Query in TSA 

The query in TSA follows the format of {S, C, R, t, w), where S 
is a set of keywords, C denotes the required accuracy, R is the 
domain of answers, t is the timestamp of the query and w is the 
time window of the query. 

For example, suppose the user wants to know the public opinions 
for iPhone4S from Oct-14-2011 to Oct-23-20II, the correspond- 
ing query can be expressed as: Q=({iPhone4S, iPhone 4S}, 95%, 
{Best Ever, Good, Not Satisfied}, Oct-14-2011, 10). The answer 
to the query consists of two parts. The first part is the percentage 
of each opinion and the second part comprises the reasons. For 
the above query, one possible answer is that most people perceive 
iPhone4S is a good product thanks to the features of Siri and iOS 
5, while a smaller but significant number of people are not satisfied 
with its display and battery performance. 



Table 1: Users' Opinion on iPhone4S 



Opinions 


Percentages 


Reasons 


Best Ever 
Good 
Not Satisfied 


60% 
10% 
30% 


Siri, iOS 5, Performance 

Siri, 1080P 
iPhone4, Display, Battery 



The query definition of TSA is registered in the job manager, 
which then generates the corresponding processing plan. The pro- 
gram executor is responsible for retrieving the twitter stream and 
checking whether the query keyword {S — iPhone4S in above ex- 
ample) exists in a tweet. The candidate tweets are fed to the crowd- 
sourcing engine, which will generate a query template as shown in 
Figure 3. 

When the crowdsourcing engine collects enough tweets in its 
buffer, it starts to generate the HIT (Human Intelligence Task). In 
particular, it creates an HTML section (bounded by <div> and 
</div>) for each tweet using the query's template. For all the 
tweets in the buffer, we concatenate their HTML sections to form 
our HIT description. Therefore, one HIT in the TSA job contains 
questions for multiple tweets about the same product, movie, per- 
son or event. 

The HIT is then published into the AMT for processing. Algo- 
rithm 1 summarizes the two-phase query processing in the crowd- 
sourcing engine (note that Algorithm 1 describes the general query 
processing strategy, not just for the TSA job). In the preprocessing, 
the engine generates a HIT job for the tweets using the query tem- 
plate (line 1-6). In the first phase, it applies the prediction model 
to estimate the number of workers required to satisfy the prede- 
fined accuracy (line 7). In the second phase, it submits the HIT to 
AMT and waits for the answers (line 8-10). The verification model 
is used to select the correct answers. In line 7, Q.C denotes the 
accuracy requirement specified by query Q. 



Algorithm 1 queryProcessing ( ArrayList<Ti(;eet> 
buffer, Query Q ) 

1: HtmlDesc H= new HtmlDesc() 
2: for i = to bu f fer. sizQ-l do 
3: Tweet t = buf fer.gQt(i) 

4: HtmlSection hs = new HtmlSection((5.template(), t) 
5: if.concatenate(/is) 
6: HIT task = new mT(H) 
7: int n=predictWorkerNumber(Q.C) 
8: submit(tasA;, n) 
9: while not all answers received do 
1 : verify Answer() 



In Algorithm 1, the two models direct the whole procedure of 
query processing, which are also the focus of this paper and will be 
presented in the following sections. 

3. PREDICTION MODEL 
3.1 Economic Model in AMT 

The prediction model is designed to ensure high-quality answers 
and to reduce cost. It is highly related to how the crowdsourcing 
platform charges the requesters. Therefore, we first briefly intro- 
duce the economic model of AMT. 

In AMT, a HIT is published and broadcasted to all candidate 
workers. Any candidate worker can accept the task. Thus, if n an- 
swers for a HIT are required, from the point of view of CDAS, there 
will be n random workers providing the answers. AMT charges 
CDAS for each HIT using the following rules: 

1 . Every worker is paid a fixed amount of money rric. 

2. CDAS pays a fixed amount of money rris per worker to the 
AMT system for each HIT. 

Therefore, we spend {mc-\-ms)n for each HIT. Take query Q — 
(S, C^R^t, w) in TSA as an example, if we get K available tweets 
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for each time unit, the cost of processing Q is {rric + ms)nKw. In 
our predication model, the number of workers is correlated to the 
required accuracy C. We use function g to denote the relationship 
between C and n. Consequently, the query cost can be represented 
as {rric + ms)wK x g{C). Before we present the technical details, 
we summarize the notations used in the paper in Table 2. 

Table 2: Table of Notations 



LA 


LilC fsCL Ui WUlJvClf) 


Ui 


LllC fc-Lll WUiJS.Cl 




flip niiTTiHpr wnrVprQ 


2 


the probability of at least [^] 
workers provide the correct answer 


A 


the set of accuracy of workers 


ai 


the accuracy of worker m 




the mean value of worker accuracy 




the answer provided by worker m 




the observation of distribution of answers 


P(r|Q) 


the probability of answer r being correct 
under the observation Q 


m 


the number of all possible answers 


Ci 


the confidence of worker m 


p(ri) 


the confidence of answer n 



3.2 Voting-based Prediction 

Given n (n is odd) answers from workers U = {ui ,U2, ...,Un}, 
the voting strategy accepts an answer if at least [^] workers return 
the same answer. While the voting strategy guarantees that no other 
answers have more votes of being the correct answer, it however 
does not address the problem of how to select n. 

To address the above problem, we propose a voting-based pre- 
diction model. Given an accuracy requirement, the prediction model 
estimates the number of workers required. That is, the goal of the 
prediction model is to derive the function g for each query. We 
prove in Section 4 that the model can also produce a bound for our 
probability-based verification approach. 

3.2.1 A Conservative Estimation 

We compute the probability that at least |"^] workers provide 
the correct answer. We use to denote the probability. Suppose 
the accuracy of all n workers are A = {ai, a2, • • • , an}, where 
the accuracy means the probability of a worker providing a correct 
answer. By the definition of Pn. , we have the following equation: 



E (n«n(i 



■ %■)) 



U denotes a subset of user set U with size no smaller than |"^] . The 
above equation enumerates all the possible cases that the correct 
answer can be obtained by voting. 

The workers of a HIT can be considered as random workers from 
AMT. Let /i denote the mean value of the workers' accuracy. We 
have the following theorem to compute the expectation of the prob- 
ability that at least [^] workers return the correct answer: 

Theorem 1. If workers answer the queries independently, 



Proof. As all workers are randomly picked, ai and aj are in- 
dependent for any i ^ j. Similarly, ai and 1 — aj are also indepen- 
dent. Thus, 

= ^[ E ( E (E «^ n(i-«^)))] 
= E ( E (H n ^[(1 -«.)])) 

We have E[ai] = fi and E[l — ai\ = 1 — /x. Therefore, E[P^] can 
be computed as: 



E( E (n/^rid-M))) 

f2 ( E M'a-M)"-') 



E ilVi^-^y 



□ 



For a given query, we require E{Pn ) to be no less than a given 
accuracy C, i.e., E{Pn) ^ C. Furthermore, we derive a lower 
bound of E{P]i ) that can be easily computed as follows. 

Theorem 2. E[Pi^] ^ 1 - e-^^^^-^)" 
Proof. By Chemoff Bound, 



E 

fe=LtJ+i 



^ \ kf-t \n-k ^ -2n(u-4)^ 



Moreover, for any odd n, we have 



Therefore, 

E[P^] 



E rM^-^r 



El ^ \ \ 1 -2n(u-i)2 



□ 



By requiring 1 - e"^"^^^" ^ C, we guarantee that E[P^] ^ 
C (i.e., the expected accuracy of the query result is no less than C). 
Consequently, we obtain a sufficient condition for the quality of the 
crowdsourcing query engine: 

Theorem 3. Given required accuracy C and the mean value 
of workers ' accuracy /j, choosing 

-ln(l-C) 



n > 



workers ensures the expected accuracy of the crowdsourcing result 
no less than C. 

Note that n is an odd integer, so the minimum value of n is 

21 -ln(l-C) I ,1 
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3.2.2 Optimization with Binary Search 



Algorithm 4 desampling (HIT H) 



Setting n to 2[- 



- In(l-C) 





J + 1 ensures the expected accuracy of 



4(M-i)2 

results. However, it is well known that Chernoff Bound provides a 
tight estimation only for a large enough n. In some HITs, only a 
few workers participate in processing. Therefore, Theorem 3 gen- 
erates a conservative estimation that may cause too many workers 
to be involved. To address this problem, we use Theorem 3 as an 
upper bound and apply a binary search algorithm (on odd numbers) 
to find a tighter estimation, i.e. the minimum odd n that satisfies 
E[P^] ^ C. 

Algorithm 2 binarySearch (double C) 
lie is the required accuracy 

1: ints = l,inte = 2L^^;^J+l 

2: while s < e do 



3: intm = 2[^ + |J -1 

4: int ^m=computeExpectedProb(m) 

5: if^m^Cthen 

6: e — m 

7: else 

8: s = m + 2 

9: return e 



Algorithm 3 computeExpectedProb (int x) 



double £;=0, ^=/i^ 
for int i=x to [|] do 



return E 



/^(CC — 2+1) 



Algorithm 2 shows the idea of binary search. We initialize the 
domain of n to be [1, 2[ ~^^|^^7p^ J + 1] (line 1). At each step, we 

compute the expected accuracy of using m workers (line 4), until 
we reach the minimum m that satisfies the accuracy requirement. 
Algorithm 3 illustrates the process of computing the expected ac- 
curacy. Its correctness is based on the fact that (fJli) / (2) — 
k/(n — k + 1). Obviously, the time complexity of Algorithm 3 
is 0{n). Therefore, we can get a tighter bound of the number of 
workers required using Algorithm 2 in 0(n log n) time. 

3.3 Sampling-based Accuracy Estimation 

In the previous two prediction models, we rely on the statistics 
of workers' accuracy distribution. However, not all crowdsourcing 
platforms provide such information due to the privacy issue. Even 
if some platforms provide certain statistics, they cannot be directly 
used as workers' accuracy. For example, AMT system records the 
approval rate of each worker. Approval rate shows the percentage 
of answers approved by the requester. However, we have observed 
that the approval rate is not consistent with the accuracy of the 
worker in CDAS. There are two main reasons. First, the worker's 
accuracy may vary widely across jobs. Second, some requesters 
set automatic approval for all answers without verification. The 
difference of approval rate and accuracy is studied through experi- 
ments. To resolve the above problem, we design a sampling-based 
approach. Specifically, for a registered query, we randomly embed 
m questions, whose ground truth are known beforehand. These 
questions are used as our testing samples to estimate the workers' 
accuracy. 

Here we use TSA application to illustrate the sampling method. 
As mentioned previously, each HIT contains the questions of J3 
tweets. To get unbiased results, we randomly inject aB samples 



WorkerSet [/=if.getWorkers() 
Double[] rate = new Double[[/.size] 
while i7.nextQuestion()7^ null do 
Question q = if.getNextQuestion() 
if ^ is a testing sample then 
for i = to [/.size do 
Worker u = U.gQt(i) 
if i^.getAnswer(^)==^.groundTruth then 



rate\i\ 



rate\i\ + 



into a HIT. In other words, each HIT has aB testing samples and 
(1 — a)B new tweets. In our current implementation, a and B are 
set to 0.2 and 100, respectively. We evaluate the effect of sampling 
rate a in our experiments, and the results confirm that even a low 
sampling rate can produce an acceptable estimation. 

In the sampling process, CDAS collects the accuracy of partici- 
pating workers. Algorithm 4 shows the procedure. After the sam- 
pling, the statistics are used in both the prediction model and the 
verification model. 

4. VERIFICATION MODEL 

In the voting-based verification, if more than half of the work- 
ers return the same answer, the query engine will accept it as the 
correct answer. Despite the fact that our predication model tries to 
guarantee that at least half of the workers submit the correct an- 
swer, the voting-based verification occasionally fails to provide an 
answer. 

For a specific question, different workers may provide different 
answers, and in some cases, no answer gets an agreement above 
50%. Moreover, the voting strategy assumes that all the workers 
provide the correct answer with the same probability, which is not 
true as the accuracy of different workers varies a lot and the work- 
ers with higher accuracy are more trustable. In this section, we 
propose a probability-based verification method to determine the 
best answer. 

4.1 Probability-based Verification 

Probability-based verification tries to evaluate the quality of an- 
swers through workers' historical performances (i.e. accuracy). 
In particular, given the probability distribution of workers' perfor- 
mances, we apply the Bayesian theorem to estimate the accuracy 
of each result. We adopt and extend the approach proposed in the 
data fusion [4] for integrating conflicting results in the CDAS. 

Suppose a HIT is answered by n workers {ui , iX2 , • • • , '^^n } with 
accuracy {ai, a2, • • • , an}- We define function f{ui) to represent 
the answer provided by worker m. Based on Bayesian analysis, 
the probability of a specific answer f ^ R being the correct an- 
swer given the observation of the answer's distribution Q (i.e. the 
answers provided by n workers) can be computed as: 



P{f\Q) 



P{Q\r)P{r) 
P{Q) 
P{Q\r)P{r) 

Suppose the size of the answer domain \R\ = m. Without a priori 
knowledge, each answer Vi ^ R appears with equal probability of 
— . Then the above equation can be transformed into: 



P{f\Q) 



P(Q|f) 



(1) 
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Let f be the correct answer. The probabiUty for worker Uj pro- 
viding the correct answer is aj (i.e. accuracy). Without any pri- 
ori knowledge, each incorrect answer provided by uj appears with 
equal probability . Therefore, P{Q\f) can be computed as: 



P{Q\r) 



(2) 



f{uj)^r f(uj)^r 

Combining Equation 1 and 2, we have 



p(f|Q) 



n l — 



n 



{m — l)aj 



Uj) = 



(m — l)aj > 



(3) 



For ease of illustration, we define the Worker Confidence for an 
answer as follows. 

Definition 2. Worker Confidence 

Let ttj be the accuracy of worker uj . The confidence cj of worker 
Uj is defined as: 



In 



(m - l)aj 



ln(m — 1) + In ■ 



(i.i.d.), because the accuracies of the workers are i.i.d. Let Ec de- 
note the mean value of workers' confidences. As a result, the total 
number of expected votes for answer f is 

= E,.|{«j|/(»,) = r)| 
> l-'- 

Note that in Equation 4, all answers share the same denominator. 
The value of P{f\Q) is proportional to ^f^'^j^^^ . Thus, f is the 
answer with the largest expected confidence and is returned as the 
result in expectation. Otherwise, if another answer r' has a larger 
expected probability than f, i.e., 

E\P{r'm > E[P{m] 



Therefore, 



E[ E > s[ E '^^•i > 



We will have 



From the above definition, we can see that high-accuracy workers 
will get large confidence values. This is consistent with the intu- 
ition that workers with higher accuracy are more trustable. 

Based on the definition of worker confidence and the equation 3, 
we define the Answer Confidence as below. 



E[ E + E c,-]>nS, 

f{uj)=r' f(uj)=f 

In fact, the sum of workers' confidences is equal to the sum of 
confidences for every answer: 



Definition 3. Answer Confidence 

The confidence of an answer f equals to the probability of f being 
the correct answer: 



p{r) = P{r\Q) 



(4) 



In our CDAS, the answer with the highest confidence is accepted 
as the final result. In fact, the confidence of an answer represents 
a variant of voting, where e^^ is used as the weight for worker 
Uj. Apparently, the worker with a higher confidence gets more 
weight. To speed up the computation of P(f|Q), we cache the 
value In "-^ for each known worker. 

We can prove that using Theorem 1 to estimate the number of 
workers required also produces a quality bound for our probability- 
based verification approach. 

Theorem 4. If E[P^] ^ C and let f be the correct answer, 
we have that our probability-based verification model returns f as 
the result with a probability no less than C. 

Proof. Based on Theorem 1, 



E 



^)M'=(i-Mr-'=>c^ 



Namely, the expected number of workers, who provide the correct 
answer, is larger than ^ with a probability larger than C. The con- 
fidences of all workers are independent and identically distributed 



= S[ E ( E c.)] 

This results in a contradiction that the sum of confidences of f and 
r exceeds the sum of all confidences. Therefore, our probability- 
based verification model returns f as the result with a probability 
no less than C. □ 



The only unknown parameter in Equation 4 is m, the size of R. 
We can simply set m = \R\. However, in our experimental study, 
we have found that not all answers in R are picked by the workers. 
For example, if a question asks a worker to rank a product based on 
some tweets and the score ranges from to 100, the scores will fol- 
low a very skewed distribution. Some low-probability answers are 
never selected, but they do reduce the weight of a correct answer. 
Thus, we need to select a good m to prune the noise. 

After a HIT completes, the crowdsourcing engine gets k distinct 
answers for a specific question from n workers (k < n). In this 
observation, we select k distinct answers among m possible ones. 

The probability of this selection can be computed as . Suppose 
this is not a very rare observation and the probability of this obser- 
vation is larger than e (e.g., we prune the low-probability noise). 
The following lemma provides a lower bound for m. 



Lemma 1. m > 



Hk-i-{k-l){ke)-- 
is the k-th Harmonic number 



k 

, where Hk = ^ j 

i=i 
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Proof. 

e < 



1 m(m — 1) • • • (m — A: + 1) 



k(k 

)...(!_ tii) 



/C • 1 • • • (/c - 1) 

Derived from the above equation, we have 

Hk-i 1 



Therefore 

m > 

□ 



k — 1 m 

k-1 



k — 1 



For a large k, the above lower bound is too loose. Instead, we 
propose a tighter lower bound for m: 



Lemma 2. m > 



k-1 



Proof. From Lemma 1, we have 



e < 



Therefore, 



Obviously, 



By setting k In ■ 



In e < In - 



■ ^ >k\n 

i k 

> In e, we get a tighter bound: 



m > 



□ 

Theorem 5. 

m > max{- 



k- 



k- 



1| 

ek J 



Hk-i - (k - l){ke)^ 1-^ 
Proof. Directly from Lemma 1 and 2. □ 

In our verification, we set e to 0.05 based on Fisher's exact test 
[5], which is widely adopted in practice. We then use Theorem 5 to 
estimate the value of m. 

We now give an example in TSA to show the benefit of apply- 
ing our probability-based verification model. Table 3 shows the 



Table 3: An Example of Workers' Answers 



Movie Title 


Green Latern 


Tweet 


Oh. My. GOD. "Green Lantern " movie is 
terrible. Like, "Lost In Space " movie terrible. 


Worker ID 


Wi 




W3 






Accuracy 


0.54 


0.31 


0.49 


0.73 


0.46 


Answer 


pos 


pos 


neu 


neg 


pos 



Table 4: Results of Verification Models 





pos 


neu 


neg 


Answer 


Half-Voting 


3 


1 


1 


pos 


Majority- Voting 


3 


1 


1 


pos 


Verification 


0.329 


0.176 


0.495 


neg 



example. Five workers with different accuracies provide three dif- 
ferent answers, namely Positive, Neutral and Negative. The results 
of the three verification models are shown in Table 4. Both the 
Half-Voting model and the Majority-Voting model choose Positive 
as the results since three workers out of five provide the answer 
Positive. However, our verification model can correctly choose 
Negative as the result because the worker answering Negative has 
a much higher accuracy. As a result, our verification model gets 
more accurate answers than the other two voting-based models. 

4.2 Online Processing 

The workers submit their answers asynchronously in the AMT 
and CDAS has to wait for sufficient number of answers to be sub- 
mitted. As a consequence, query response time in CDAS (and other 
crowdsourcing systems for that matter) is expected to be longer 
than that of non-crowdsourcing systems. To alleviate such a prob- 
lem and also to improve users' experience, we adopt online pro- 
cessing techniques in CDAS. Instead of waiting for all workers to 
complete their tasks, CDAS provides an approximate result based 
on the answers received so far. As we have previously discussed, 
uncertainty and approximation cannot be avoided in crowdsourcing 
systems, which makes online processing a perfect fit for the query 
processing in CDAS. 

To resolve the uncertainty, we extend the techniques of data fu- 
sion [4] [14] to estimate the answer's confidence. However, the 
same approach cannot be directly applied to the online process- 
ing in CDAS, as in the crowdsourcing systems, the human workers 
compete for the tasks and CDAS does not have the profile (i.e. ac- 
curacy) for a specific user until he/she returns the answer. In our 
case, the accuracy of the answer provided by an unseen worker can 
only be estimated by the distribution of all workers' accuracies. 

4.2. 1 Finding the Correct Answer Online 

We apply Equation 4 to continuously update the probability of 
each received answer. Suppose a HIT is assigned to n workers and 
the query engine receives answers from n {n < n) workers. Un- 
like Equation 4, in this case, we only receive a partial observation 
Qs' for the answer distribution. For the remaining n — n workers, 
we have no idea about what answers they may provide. Let s de- 
note a possible answer set by the remaining workers and we use S 
to represent all the possible s. 

Let A = {a^/+i, an/+2 5 •••5 ttn} be the accuracies of the remain- 
ing n — n workers. As we do not know the identities of the remain- 
ing n — n workers, we consider all the possibilities. We use A to 
represent all the possible permutations of A. The confidence of an 
answer r being the correct one can be estimated as the expected 
probability P(r|Q^ s) over S and A, i.e., 

p{r)^Eses,AeK[P{r\^\s)] 
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The following theorem shows that Equation 4 can be applied to 
compute p(r). 

Theorem 6. Assume that workers process the query indepen- 
dently and the answers are submitted in a random order p{r) — 
P{r\VL') 

Proof. Based on the assumption, we have: 

p(r) = Es^s,A^K[P{r\VL',s)] 

= EA^K[Eses[P{r\^\s)]] 

In fact, the answer set of the remaining workers s does not affect 
the computation of the above equation. As shown in [14], 

Eses[P{r\^',s)] = P{r\^') 

The computation of p(r) can be further simplified as: 

p{r) = EAeK[Eses[P{r\n\s)]] 
= EA^K[P{r\n')] 
= P{r\Q.') 

□ 

Theorem 6 shows that the confidence of a partial result can also 
be computed by Equation 4. Therefore, we select the answer with 
maximal confidence as our correct answer. 

4.2.2 Early Termination 

When the current answers are good enough, we can terminate 
the HIT to reduce cost. The major challenge of early termination 
is how to measure the quality of the current results. Intuitively, we 
can stop accepting answers from new workers as soon as we are 
sure that the current result r will not change by the answers we 
choose to forgo. 

In particular, let n and r2 be the best and second best answers 
based on their confidence, respectively. We have P{ri\Q.') > 
P(r2\^'). Let (ixi, 1^2, iXn) be the set of workers. Suppose n' 
workers have submitted answers and n — n answers remain un- 
filled. Assume an answer set s = {f{ui) = r2\n' -\- 1 ^ i ^ n}. 
Using similar techniques in paper [14], we can prove the theorem 
of minimal possible value of P(ri\Q) and the maximal possible 
value oiP{r2\^): 



minP(ri|Q) = P(ri|Q',s) 
maxP(r2|n) = P{r2\^\s) 



(5) 



(6) 

Note that min P(ri\Q) and max P(r2 1^^) are related to the ran- 
dom variables a^/+i,a^/+25 • • • ^oin- In our algorithm, we use 
the expected value of m\iiP{ri\Q) and maxP(r2|^^), namely, 
^AGA[minP(ri|Q)] and ^AGA[maxP(r2|^^)]. However, it is dif- 
ficult to compute the expected values directly. Therefore, in prac- 
tice, we use the approximate values of ^AGA[niinP(ri|Q)] and 
£^AeA[max P(r2\^)]. We assume every remaining worker has the 
same accuracy E[ai] and use it in the Equation 5 and 6. Empirical 
results show that the approximations work well in practice. 

We propose three different strategies as the termination condi- 
tion: 

MinMax ^AeA[min P(ri > £;AeA[max P(r2|^^)] 
MinExp £;AeA[minP(ri|Q)] > P{r2\^') 
ExpMax P{ri\Q.') > ^AGA[maxP(r2|^^)] 




Kung Fu Panda 2 

Po joins Forces with a group of new kung^ masters ta lake 
ori an old en^trsy with a deadly new wtapon More » 



Efapsed time 



22% 



J 



OK BAD 



yo_b_breeiy 

kung 111 panda 2 is good 

apockylypse 

If $ Date Night! Mf Wonderiul & I have a date with the aiiofable Kung Fu 
Panda (2)r 



aowwv/NINA 

aww. kmg fu panda 2 was a cute movie even tho I didnt ^nish ftl 
Figure 4: Reviews for Kung Fu Panda 2 



MinMax guarantees that the answer output by our system is stable 
when the termination condition is achieved. However, it is too con- 
servative. MinExp and ExpMax can terminate the processing much 
earlier, but may lead to low-quality results. We study the effect of 
the three strategies in our experiments. 

Algorithm 5 onlineProcessing (Question q) 

1 
2 
3 
4 
5 



Set answer=nQw Set() 

Map< Answer, float > result= new Map() 
while not all answers are returned do 
Answer A = getNextAnswer(^) 
answer.Sidd(A) 
Set distinct Answer = getDistinctAnswer(ansit;er) 
for i = to distinct Answer .sizQ-\ do 
Answer A= distinct Answer .gQt{i) 
float confidence = computeConfidence(A) 
result.put(A, confidence) 
if canTerminate(resn/t) then 
break 
return result 



Algorithm 5 outlines the online processing strategy adopted in 
CDAS. The query engine continuously updates the confidence of 
each answer (line 3-13) until the termination condition is satisfied. 
We apply Equation 4 to estimate the confidence of each answer 
(line 9) and apply one of the three termination strategies to decide 
whether to stop the processing (line 11). 

4.3 Result Presentation 

In the onl ineP recessing Algorithm (Algorithm 5), if there 
is an answer that meets the termination condition, online processing 
will stop and CDAS will accept the answer. Otherwise, if none of 
the answers is good enough, CDAS will update the confidence of 
each answer according to Equation 4. 

We take queries in TSA as an example to illustrate the result pre- 
sentation. Given a list of tweets ti, t2, ^at, let function /it. (r) 
return the score of answer r for tweet ti. ht^ (r) is defined as fol- 
lows: 




if r is accepted for ti 

if another answer is accepted 

none of the answers are accepted 
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The percentage of answer r is then computed as X^^i ht^ (r). 
Moreover, we generate a set of keywords as reasons for each an- 
swer r. These keywords are the most frequent keywords submitted 
by the workers who have provided the answer r. The results are 
updated as new tweets are being streamed into TSA. 

Figure 4 shows the onUne processing interface of TSA for the 
review resuhs of Kung Fu Panda 2. It summarizes Twitter users' 
opinions into three categories. The time window of the query is set 
to 12 minutes and in the elapsed time (4 minutes), 20 tweets are 
fed to TSA, among which 70% of tweets say Kung Fu Panda 2 is 
a good movie. TSA updates the result upon new tweets arriving. 
Users can click an answer to expand the view. TSA will list the 
corresponding tweets for the answer. The tweets are sorted based 
on timestamps from the newest to the oldest. The user can also 
check the progress of the current running HIT. 

5. PERFORMANCE EVALUATION 

To evaluate the effectiveness of the quality-sensitive answering 
model in CDAS, we developed two crowdsourcing applications, a 
twitter sentiment analytics (TSA) job and an image tagging (IT) 
job. We present the comprehensive experimental results over TSA, 
and due to the space constraint, we shall only provide the compar- 
ison with an online image tagging toolkit for the IT application. 
The results for the other experiments over IT exhibit similar trends 
to those of TSA. 

By default, our approach applies the probability-based verifica- 
tion model (denoted as Verification) to select the best answer. For 
comparison, the Half- Voting and Majority- Voting models are used 
as two alternative verification approaches. Suppose n {n is odd) 
workers are employed for a particular task. In the Half-Voting 
model, the answer Vi is accepted only if no less than ^ workers 
return it as their answers. In the Majority-Voting model, let v(ri) 
denote the votes for answer n. The answer n is accepted if for any 
other answer r^, v{ri) > v{rj). 

5.1 Application 1: TSA 

We deploy TSA on AMT and use 200 movie titles as our queries. 
The selected titles are the most recent movies listed in IMDB (Inter- 
net Movie Database). The query follows the format of Q=({movie 
name}, accuracy requirement, {Positive, Neural, Negative}, Oct- 
1-2011, 1 day). Namely, the queries are processed against one-day 
tweets. For each HIT, 30 workers are employed to perform the re- 
view categorization task. We manually check each of the reviews 
to generate our ground truth. 

5.1.1 Crowdsourcing VS. SVM Algorithm 

We first show the advantages of crowdsourcing techniques over 
computer programs. We compare the results of TSA with LIB- 
SVM^. To build an automatic classification model using LIBSVM, 
tweet reviews about five movies are selected as the test data, and 
tweets about the rest 195 movies are used as training data. After a 
stream of tweets passes the filters of TSA, we also send it to LIB- 
SVM and collect the corresponding results. We then compare the 
results against our ground truth. In TSA, we vary the number of 
workers from 1 to 5. Figure 5 shows the accuracies of both sys- 
tems for five movies, each with 200 tweet reviews. In most cases, 
TSA can achieve a higher accuracy than LIBSVM, even if only one 
worker is employed. This indicates that humans are much better at 
natural language understanding than machines. For such tasks, if 
high accurate results are required, crowdsourcing is a promising 
approach. 

^http://www.csie.ntu.edu.tw/^cjlin/libsvm/ 



5.1.2 Accuracy Analysis 

In TSA, we first apply Theorem 1 to estimate the number of 
workers required. This is a conservative estimation. To reduce cost, 
binary search is used to refine the estimation. Figure 6 compares 
the conservative estimation with the refined estimation generated 
by the binary search. We change the user required accuracy from 
0.65 to 0.99 and find that the refined estimation is less than half of 
the conservative estimation. In the remaining experiments, we use 
the refined estimation to determine the number of workers required 
for each HIT. 

We next present the accuracy for the three verification models, 
namely Half- Voting, Majority- Voting and our proposed Probability- 
based Verification model. Figure 7 shows that when the number 
of workers increases, we can get a higher accuracy. Among the 
three verification models, our probability-based approach achieves 
a much higher accuracy than the other two. When 29 workers are 
employed, the probability-based model improves the accuracy to 
0.99. This verifies the benefit of considering workers' historical 
performance. 

We proceed to investigate the effectiveness of the three verifi- 
cation models with respect to a user required accuracy. Figure 8 
shows the result. When the requester specifies a required accuracy, 
TSA estimates the number of workers needed to achieve that ac- 
curacy. The real accuracy is computed by comparing the workers' 
answers with the ground truth. The red line in the figure denotes 
the user required accuracy. We observe that the probability-based 
verification model always provides a satisfactory result while the 
results of the other two models are below the required accuracy in 
most cases. 

We can observe that the accuracy of the Half-Voting model is 
worse than our estimation. The reason is as follows. First, the 
estimated number of workers ties to users' mean accuracy. The 
mean accuracy used in the prediction model is an overall accuracy, 
which is collected across various questions. However, for some 
diflicult questions, workers' accuracies could be much lower. As 
a result, the number of workers needed in voting models is more 
than the estimated number. For example, the following tweet about 
movie The Last Airbender expresses a positive opinion whereas 
most workers classify it into the negative category because of the 
word "sucks". 



My nephew just said that Avatar: The Last Airbender 
sucks... Fm disowning him. 



The second reason can be explained based on the results of Fig- 
ure 9 and Figure 10. Figure 9 shows the percentage of tweets with 
no answers in the two voting-based models. In some cases, the 
Half-Voting and Majority-Voting models fail to provide a result as 
none of the answers is discriminative (All answers get no more than 
half votes or more than one answers get the same number of votes). 
When the number of workers increases, Majority-Voting can solve 
the tie more easily. However, for the Half-Voting strategy, there 
are still about 15% of the tweets that cannot obtain answers with 
more than half the amount of votes. In Figure 10, when we vary 
the number of tweet reviews, we observe that the percentage of 
no-answer reviews is fairly stable. This phenomenon indicates that 
the reviews with non-discriminative answers are almost uniformly 
distributed among all reviews. 

5.1.3 Online Processing 

One advantage of our crowdsourcing engine is its ability in sup- 
porting online processing. It can provide an approximate result 
without waiting for all the workers to finish their jobs. Specifically, 
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Figure 5: Crowdsourcing vs. SVM Algorithm 
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TSA will generate an initial result as soon as the first answer is re- 
turned. Then it will gradually refine the results as more answers 
arrive until the termination condition is satisfied. This allows us to 
terminate a HIT and cap the processing cost^ . 

One interesting observation in our experiments is that the ac- 
curacy of the approximate result varies significantly for different 
answer arriving sequences. Figure 1 1 shows the accuracy of the 
same HIT under four different answer sequences. The red line is 
the user-required accuracy 0.94. Sequence 4 results in a low start- 
ing accuracy because the first two workers of sequence 4 provide 
incorrect answers. Therefore, in online processing, we must up- 
date the confidence of the current result dynamically based on the 
answers received as early termination may potentially degrade the 
accuracy. 

We evaluate the three termination strategies as discussed in Sec- 
tion 4.2.2. Figure 12 shows the effect of early termination on the 
number of workers. The red line denotes the estimated number of 
workers via our refined prediction model. The MinMax strategy 
generates the most conservative estimation, but it still reduces the 
number of workers by 20%. The ExpMax strategy is the most ag- 
gressive one, which can save more than 50% of workers. In Figure 
13, we show the accuracies of the different termination strategies. 
The X-axis is the accuracy requirement specified by the user and 
the y-axis is the real accuracy measured against the ground truth. 
We can see that the MinMax and ExpMax strategies satisfy the user 
required accuracy (denoted as red line) in all cases while MinExp 
fails to meet the requirement at a few points. In view of the need for 
reducing the number of workers while maintaining good accuracy, 
we propose to adopt the ExpMax termination strategy. 



^In AMT, we can cancel a HIT when we detect that the answers 
are good enough. By doing so, we do not need to pay workers who 
have yet submitted their answers. 



5.1.4 Effect of Sampling 

TSA verifies the answers using the probability-based verification 
model, which relies on workers' historical performance. The AMT 
system records an approval rate for each worker, which implies 
his accuracy in general. However, the workers' approval rates are 
not public due to privacy concerns. To collect the statistics, we 
publish 500 HITs requiring workers to fill in their approval rate. We 
also compute the workers' accuracies of answering TSA queries. 
We observe the distribution of their approval rate in AMT is very 
different from that of real accuracy in TSA, as shown in Figure 14. 
The reasons are two-fold. On one hand, there are various types 
of tasks in AMT and it is natural that people cannot be experts 
in all domains. On the other hand, some requesters set automatic 
approval for all workers without checking the answers. This results 
in a high average approval rate in AMT. Therefore, we adopt a 
sampling approach to estimate workers' accuracy. 

Given n works, we compute their accuracies — {a\ , 
ai} under a sampling rate j%. We vary the sampling rate and 
plot the mean accuracy \J and average absolute error err^ in Fig- 
ure 15, where \J and err^ are defined as follows: 

n n 
3 ^ \^ 3 3 ^ \^ I 3 100 1 

p? = -^a\, err' = -^\a\ - ai \ 

i—1 i—1 

As shown, both mean accuracy and average error are stable when 
the sampling rate is higher than 10%. More precisely, mean accu- 
racy remains nearly constant and average error approaches 0. 

We also study the effect of sampling rate on accuracy in our 
verification model. Figure 16 plots the result. We vary the sam- 
pling rate from 5%, 10% to 20% and compare the result to 100%- 
sampling accuracy. The red line represents the user required ac- 
curacy. We can see that the verification has a better accuracy with 
a higher sampling rate. When the user required accuracy is lower 
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than 0.75, all sampling rates are satisfactory. The result meets all 
of the user required accuracy only with a sampling rate no less than 
20%. Moreover, the accuracy under 20% sampling rate has only 
a small gap compared to that under 100% sampling. We use 20% 
sampling rate in all of our verification experiments. 

5.2 Application 2: IT 

In this experiment, we evaluate our model in the context of image 
tagging application. We use 100 Flicker images as our queries. For 
each image, we give a set of candidate tags and let 30 workers to 
choose the related ones. The candidate tags include Flicker tags 
and some embedded noise tags. 

Again, we first show the advantages of crowdsourcing over the 
applications on dealing with image tagging task. We compare our 
result with ALIPR^. ALIPR[13] is an automatic image annotation 
system which applies 2-D Hidden Markov model and clustering 
techniques. The accuracy comparison result is shown in Figure 17. 
We use 5 groups of images. Each group contains top 20 Flicker 
images returned by a tag. The figure clearly shows the accuracy gap 
between ALIPR and crowdsourcing approach. ALIPR achieves its 
best accuracy 30% on tag sun and has only 12.6% accuracy on tag 
apple, whereas in our crowdsourcing system, we can reach more 
than 80% even with only one worker employed. 

We next study the effectiveness of our model. Recall that our 
model first estimates the number of workers for a specified accu- 
racy requirement and then applies a probability-based model to ver- 
ify the result. Figure 1 8 shows the accuracy achieved with respect 
to the user required accuracy. As before, the red line denotes the 
user required accuracy. It can be seen from the figure that our model 
can always satisfy user's requirement. 



6. RELATED WORK 

The emergence of Web 2.0 systems has significantly increased 
the applicability and usefulness of crowdsourcing techniques. A 
complex job can be split into many small tasks and assigned to 
different online workers. Amazon's AMT and CrowdFlower^ are 
popular crowdsourcing platforms. Studies show that users exhibit 
different behaviors in such micro-task markets [11]. A good incen- 
tive model is required in task design [10]. 

Recently, crowdsourcing has been adopted in software devel- 
opment. Instead of answering all requests with computer algo- 
rithms, some human-expert tasks are published on crowdsourcing 
platforms for human workers to process. Typical tasks include im- 
age annotation [21][18], information retrieval [1][8] and natural 
language processing [3] [12] [17]. These are tasks that even state-of- 
the-art technologies cannot accomplish with satisfactory accuracy, 
but could be easily and correctly done by humans. 

Crowdsourcing techniques have also been introduced into the 
database design. Qurk [16] [15] and CrowdDB [6] are two exam- 
ples of databases with crowdsourcing support. In these database 
systems, queries are partially answered by AMT platform. Our 
system, CDAS, adopts a similar design. On top of the crowdsourc- 
ing database, new query languages, such as hQuery [20], have been 
proposed, which allows users to exploit the power of crowdsourc- 
ing. Other database applications, such as graph search [19], can be 
enhanced with crowdsourcing techniques as well. 

One main obstacle that prevents enterprise-wide deployment of 
crowdsourcing-based applications is quality control. Human work- 
ers' behaviors are unpredictable, and hence, their answers may be 
arbitrarily bad. To encourage them to provide high-quality answers, 
monetary rewards are required. Munro et al. [17] showed how to 
design a good incentive model to optimize workers' participation 



"^httpi/Zalipr.com/ 



^ http ://crowdflower. com/ 
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Figure 17: Crowdsourcing vs. ALIPR 
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Figure 18: Accuracy Obtained wrt. User Required Accuracy 



and contributions. Ipeirotis et al. [9] presented a scheme to rank 
the quahties of workers while Ghosh et al. [7] tried to accurately 
identify abusive content. Unlike previous efforts, in this paper, we 
have designed a feasible model that balances monetary cost and ac- 
curacy, and proposed a crowdsourcing query engine with quality 
control. One of the main challenges of our query engine is how 
to integrate the conflicting results of human workers. The similar 
problem has been well studied in the data fusion systems, for exam- 
ples [4] [14]. We extended the models proposed in [4] [14] to select 
and verify the crowdsourcing results in our CDAS. 



7. CONCLUSION 

Crowdsourcing techniques allow application developers to har- 
ness the natural expertise of human workers to perform complex 
tasks that are very challenging for computers. However, as humans 
are prone to errors, there is no guarantee for the results of crowd- 
sourcing. In this paper, we introduced the quality-sensitive answer- 
ing model in our Crowdsourcing Data Analytics System, CDAS. 
The model guides the query engine to generate proper query plans 
based on the accuracy requirement. It consists of two sub-models, 
the prediction model and the verification model. The prediction 
model estimates the number of workers required for a specific task 
while the verification model selects the best answer from all re- 
turned ones. To improve users' experience, when verifying the re- 
sults, our model embraces online processing techniques to update 
answers gradually. By adopting the models, CDAS can provide 
high-quality results for different crowdsourcing jobs. In this pa- 
per, we have implemented a twitter sentiment analytics job and an 
image tagging job on CDAS. We used real Twitter data and Flickr 
data as our queries. Amazon Mechanical Turk was employed as our 
crowdsourcing platform. The results show that our proposed model 
can provide high-quality answers while keeping the total cost low. 
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